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Abstract 

During  the  period  of  June  1999  to  June  2000,  supported  by  ARO  grant  DAAD  19- 
99-1-0248,  we  developed  a  novel  nonlinear  transformation  to  process  spoken  words 
in  noisy  environment,  based  on  human  hearing  perception  and  properties  of  focusing 
partial  differential  equation  (PDE).  The  transformation  was  made  on  the  short-term 
Fourier  spectra  of  speech  signals.  It  was  designed  to  reduce  noise  through  time 
adaptation,  and  enhance  spectral  peaks  (formants)  by  evolving  a  focusing  quadratic 
Cahn-Hillard  equation.  Time  adaptation  and  peak  focusing  (a.k.a  lateral  inhibition) 
are  essential  processing  mechanisms  in  human  cochleas. 

Numerical  results  on  noisy  spoken  words  indicated  that  the  transformed  spectral 
pattern  of  the  spoken  words  was  insensitive  to  noise  (signal-to-noise  ratio  (SNR) 
ranging  from  0  to  20  dB).  The  spectral  distances  between  noisy  and  original  words 
decreased  after  the  transformation.  Numerical  experiment  on  eleven  spoken  words  at 
SNR  =  5  dB,  for  example,  reached  a  recognition  rate  as  high  as  100%.  These  very 
encouraging  results  showed  the  success  of  our  nonlinear  transformation  and  the  needs 
of  its  further  development  within  our  framework. 

In  this  final  report,  we  state  the  problem  studied,  summarize  main  results,  and  point 
out  future  directions. 
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1  Statement  of  the  Problem  Studied 


Consider  the  spectral  data  arising  from  short-time  window  Fourier  transform  of  a 
sound  wave  of  a  spoken  word.  The  spectral  data,  known  as  spectrogram,  is  a  matrix 
aitj  —  a(U,Xj),  1  <  i  <  I,  1  <  j  <  J,  where  I,  J  are  positive  integers;  tfs  are 
discrete  sampling  of  time,  typically  at  a  rate  of  10-30  ms  per  frame;  and  Xj’s  are 
discrete  sampling  of  the  frequency,  typically  scaled  according  to  human  auditory 
perception  or  bark-scale  (roughly  logarithmic  scale).  The  spectrogram  is  a  space- 
time  distribution  of  acoustic  energy,  and  as  usual,  we  consider  it  in  unit  of  decibel 
(dB),  or  the  uhj  =  10(log10  |a,j|2  —  log10  /0),  where  J0  is  a  reference  intensity. 

Our  processing  and  the  resulting  recognition  will  be  on  the  matrix  u,j  for  a  spoken 
word  from  a  vocabulary  of  words  under  noisy  conditions.  The  studied  problem  is: 
If  U{j  is  received  from  a  noisy  environment,  how  do  we  process  it  so  that  the  essen¬ 
tial  speech  features  are  enhanced,  noise  effects  are  reduced,  and  recognition  rates  are 
improved  ? 

It  is  known  that  using  noisy  spectrogram  without  processing  will  lead  to  drastic 
increase  of  errors.  Methods  for  reducing  noise  effects  based  on  auditory  perception 
have  been  proposed  and  implemented  in  the  engineering  community  [4],  [11],  [12] 
among  others.  These  methods  tend  to  rely  on  various  ad  hoc  procedures  to  model 
two  mechanisms  in  human  audition:  (1)  time  adaptation  (reducing  redundancy  of 
slowly  varying  spectrum  at  any  fixed  frequency),  and  (2)  spectral  peak  (formants) 
isolation,  tracking,  and  enhancement. 

A  major  portion  of  our  study  is  to  develop  a  systematic  and  efficient  mathematical 
method  to  capture  both  time  adaptation  and  formant  structures  in  a  noisy  speech 
spectrum  without  performing  detailed  case  dependent  filters  or  structure  tracking. 


2  Summary  of  Most  Important  Results 

Adaptation  basically  is  to  reduce  any  portion  of  spectrogram  at  a  fixed  frequency  if 
there  is  not  enough  variation  in  time,  a  process  occuring  in  human  hearing  to  remove 
redundancy.  For  clean  signal,  the  spectral  curve  is  smooth  in  time,  and  so  one  can 
use  derivative  to  measure  this  variation.  Due  to  noise,  the  spectrogram  are  often 
rough,  and  one  must  devise  an  alternative  measure  of  variation.  We  first  divide  the 
frequencies  into  three  (low,  mid  and  high)  bands  and  define  an  average  curve  for 
each  band  as  a  representative.  Such  a  three  band  division  is  based  on  human  hearing 
response  to  multi-frequency  stimuli  (critical  bandwidth)  [10].  We  then  construct  an 
upper  and  a  lower  envelope  for  each  band  representative  so  that  the  difference  of  the 
two  envelopes  is  a  good  measure  of  true  signal  variation  in  time  for  each  band.  When 
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the  difference  is  below  a  threshold  (2dB),  indicating  either  noise  or  redundant  signal, 
adaptation  takes  place.  This  is  the  nonlinear  transformation  in  time,  call  it  Ad- 

For  each  fixed  time,  spectral  peaks  appear  in  vowels  at  isolated  frequency  locations 
and  with  decreasing  magnitudes  towards  higher  frequencies.  These  spectral  peaks 
are  called  formants,  and  are  locations  of  energy  concentration.  Enhancing  these 
peaks  at  the  expense  of  reducing  energies  at  neighboring  points  is  analogous  to  the 
well-known  lateral  inhibition  phenomenon  in  psychoacoustics  [5]  and  [6].  For  two 
tones,  one  stronger  than  the  other,  Houtgast  ([5],  [6])  showed  experimentally  that 
the  stronger  tone  suppresses  the  nearby  weak  tone,  and  attributed  this  to  processes 
within  the  cochlear  and  to  the  nonlinear  aspect  of  neural  coding  of  sound  spectrum. 
Neural  projection  of  sound  spectrum  tends  to  sharpen  formant  peaks  which  helps 
vowel  discrimination  and  recognition.  Motivated  by  this  connection,  we  found  and 
implemented  a  nonlinear  transformation  based  on  evolving  a  focusing  fourth  order 
nonlinear  partial  differential  equation  (known  as  the  Cahn-Hillard  equation  [2],  [3], 

W): 

Ilj  —  a(u  ^)xx  £'Ujxxxx)  (I) 

where  a  >  1,  e  >  0  is  small  enough.  Here  r  is  the  processing  time,  x  is  the  frequency. 
To  handle  roughness  of  spectral  curve  resulting  from  noise  effects  (or  to  minimize 
noise  induced  spurious  peaks),  we  first  take  a  logarithm  on  the  input  (or  initial  data 
of  (1)),  evolve  it  under  (1)  up  to  a  time  t  =  T,  then  exponentiate  the  solution.  The 
nonlinear  mapping,  call  it  F0 ,  is  then  the  composition  of  these  three  steps.  The 
focusing  step  resulting  from  evolving  (1)  mimics  lateral  inhibition  effects.  Fixing  the 
parameters  a,  e,  and  T  requires  training  on  clean  data  set.  The  transformation  F0  is 
performed  at  times  during  a  vowel,  and  hence  processes  spectrogram  along  frequency 
axis.  One  does  not  need  to  know  where  the  peaks  are  (no  tracking  is  necessary), 
and  the  peaks  are  captured  and  enhanced  automatically  by  evolving  (1).  This  is  the 
advantage  of  our  method  over  tracking  and  filtering  method  in  [11]  [12]. 

The  entire  nonlinear  transformation,  call  it  N,  is  the  combination  of  time  adapta¬ 
tion  transformation  Ad  and  the  peak  focusing  transformation  F0.  Our  numerical 
experiments  on  spoken  words,  with  noise  added  at  signal-to-noise  ratio  (SNR)  from 
OdB  to  5dB,  demonstrated  that  the  nonlinear  transformation  is  robust  and  noise  in¬ 
sensitive.  Moreover,  spectral  L 2  distances  between  noisy  words  and  original  words 
decreased  after  the  transformation.  A  numerical  experiment  was  performed  on  eleven 
spoken  words  at  SNR  =  5  dB.  A  noisy  word  is  recognized  numerically  by  computing 
the  closest  L2  spectral  distance  from  the  clean  template.  The  experiment  reached 
a  recognition  rate  as  high  as  100%.  See  [9]  for  analyses  and  justification  on  the 
properties  of  the  transformation,  and  details  of  the  reported  results. 

In  the  future,  we  plan  to  investigate  lateral  inhibition  effects  by  seeking  a  more 
neurophysiologically  based  model  ([8], [7]),  extending  the  two  tone  (or  near  neighbor) 
inhibition  picture  to  multitone  nonlinear  interaction,  this  is  currently  under  progress. 
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The  other  less  known  but  seemingly  rather  important  aspect  of  method  design  is  to 
find  out  how  to  couple  adaptation  and  lateral  inhibition  in  a  nonlinear  fashion  for 
better  modeling  the  continuous  variation  of  essential  speech  spectral  information  in 
time.  We  believe  these  are  fundamental  steps  towards  further  improving  recognition 
performance  by  a  better  approximation  of  human  auditory  processing. 


3  List  of  Publications  and  Reports 
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(2)  Y-Y  Qi,  J.  Xin,  A  Perception  and  PDE  Based  Nonlinear  Transformation  for 
Processing  Spoken  Words  (abridged  current  version),  in  Proceedings  of  the  6th  In¬ 
ternational  Conference  on  Spoken  Language  Processing  (ICSLP),  Vol.  I,  pp  445-448, 
2000. 

(3)  ARO  Interim  Report:  A  Perception  and  PDE  Based  Nonlinear  Transformation 
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